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Abstract 

The matching function for the problem of stereo recon¬ 
struction or optical flow has been traditionally designed as 
a function of the distance between the features describing 
matched pixels. This approach works under assumption, 
that the appearance of pixels in two stereo cameras or in 
two consecutive video frames does not change dramatically. 
However, this might not be the case, if we try to match pixels 
over a large interval of time. 

In this paper we propose a method, which learns the 
matching function, that automatically finds the space of al¬ 
lowed changes in visual appearance, such as due to the mo¬ 
tion blur, chromatic distortions, different colour calibration 
or seasonal changes. Furthermore, it automatically learns 
the importance of matching scores of contextual features at 
different relative locations and scales. Proposed classifier 
gives reliable estimations of pixel disparities already with¬ 
out any form of regularization. 

We evaluated our method on two standard problems - 
stereo matching on KITTI outdoor dataset, optical flow on 
Sintel data set, and on newly introduced TimeLapse change 
detection dataset. Our algorithm obtained very promising 
results comparable to the state-of-the-art. 

1. Introduction 

The matching problems, such as dense stereo reconstruc¬ 
tion or optical flow, typically uses as an underlying simi¬ 
larity measure between pixels the distance between patch- 
features surrounding the pixels. In many scenarios, for ex¬ 
amples when high quality rectified images are available, this 
approach is sufficient to find reliable pixel correspondences 
between images. To get smooth results often a regulariza¬ 
tion in the discrete or continuous CRF framework is uti¬ 
lized. The research in this field has been focused on finding 
the most robust features, designing the best distance mea¬ 
sure, choosing the right form of regularization and devel¬ 
oping fast and reliable optimization methods for the corre¬ 
sponding optimization problem. An exhaustive overview 
about matching algorithms can be found in (TJ and a com¬ 


parison of different dissimilarity measures can be found 
in O- The proposed methods are fully generative and do 
not require any form of discriminative training, except of 
the weight of regularization, which can be hand-tuned with¬ 
out much effort. 

The main problem of such approaches occurs if the scene 
contains large texture-less regions. In that case each pixel 
matches with any other pixel and the result is determined 
solely by the regularization, which typically biases the solu¬ 
tion towards a constant labelling. This causes severe prob¬ 
lems for large planar surfaces, such as walls or a ground 
plane. For many datasets, this is typically resolved using 
special dataset-dependent priors; for example for the KITTI 
dataset E) using the known height of the camera above the 
ground plane. To get state-of-the-art results it is necessary 
to over-engineer the method for each particular dataset. For 
an optical flow problem, this includes heuristics, solving 
the problem in a coarse-to-fine fashion, constraining the al¬ 
lowed range of disparities or flows based on matched sparse 
features m or recursive deep matching in a spatial pyra¬ 
mid (5) Such approaches lead to a very large boost of per¬ 
formance quantitatively, however they often miss thin struc¬ 
tures, due to the inability of the coarse-to-fine procedure to 
recover from incorrect label in the higher level of the pyra¬ 
mid. 

To find the right matching function, the researchers typi¬ 
cally focus on search for the most robust feature. Such solu¬ 
tions are often sufficient, because the appearance of pixels 
in two stereo cameras or in two consecutive video frames 
does not change dramatically. However, this is not the case, 
when we try to decide, what has changed in the scene within 
a large interval of time (such as half a year). Depending on 
the application, we might be interested in what buildings 
have been built or if there was the ship present in the port. 
The importance of temporal change detection has been rec¬ 
ognized HD for projects such as Google Street View with 
an ultimate goal of building up-to-date 3D models of cities, 
while minimizing the costs of the updates. The problem 
pixel-wise temporal change detection can also be cast as a 
matching problem; transformation to which the matching 
function has to be invariant significantly increases. For ex- 
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ample for city scenes the matching function has to allow for 
visual changes not only due to different lighting conditions, 
but also due to seasonal changes, such as the change of ap¬ 
pearance of leaves of the tree during the year or presence of 
the snow on the ground. 

In this paper we propose a method, which learns a match¬ 
ing function, that automatically finds the space of allowed 
changes in visual appearance, such as due to the motion 
blur, chromatic aberrations, different colour calibration or 
seasonal changes. We evaluated our method on pixel-wise 
temporal change detection, for which we introduce a new 
TimeLapse change detection dataset. The dataset contains 
pairs of images taken at different time of the year from 
the same spot, and the human-labelled ground truth, de¬ 
termining what has changed in a scene, for example new 
buildings, tents, vehicles or construction sites. We also val¬ 
idated our method on two standard matching problems - 
stereo matching and optical flow on two standard datasets 
KITTI and Sintel |[7), where we obtained results compara¬ 
ble to the state-of-the-art. The resulting classifier typically 
obtains smooth results even without any form of regulariza¬ 
tion, unlike other approaches, where this step is very crucial 
to remove noise and handle texture-less regions. A simi¬ 
lar idea has been independently developed at the same time 
in m, where the stereo matching cost is learnt using Con¬ 
volutional Neural Networks. 

2. Designing the matching classifier 

First we show, that the problems of stereo matching, op¬ 
tical flow or pixel-wise change detection, are all conceptu¬ 
ally the same. For stereo matching we would like to de¬ 
sign a classifier, that predicts for each pixel in an input im¬ 
age the likelihood, that it matches the pixel in the reference 
shifted by a disparity. A properly designed classifier needs 
to be translation covariant; if we shift a reference image to 
any side by any number of pixels, the estimated disparities 
should change by the same number of pixels. Intuitively, 
this property can be satisfied by just the binary classifier, 
which would predict, whether two candidate pixels match 
or not. Even though the original stereo problem is multi¬ 
label, the symmetries of the classifier reduce the learning to 
a much simpler 2-label problem. The same conclusion can 
be drawn for the optical flow problem; the classifier should 
be translation covariant in both x and y dimensions, and 
thus should be modelled as a binary classifier. For pixel- 
wise change detection the images are already aligned and 
the algorithm should only determine, whether each pair of 
corresponding pixels match or not, which leads to the same 
binary problem with much larger range of changes in the 
visual appearance. 

Let 1 1 be the reference image and I 2 the image we match. 
As we already concluded, our goal is to learn a classi¬ 
fier /2, xi, X2), that predicts whether pixels x 2 E I2 


matches to the pixel x\ E Ii in the reference image. In 
a standard feature matching approach it is typically insuf¬ 
ficient to match only the individual pixel colours, and thus 
features evaluated over a local neighbourhood (patch) are 
typically matched instead. The contextual range (size of 
patch) required to successfully match pixels is typically 
dataset-dependent; too small patches might contain insuf¬ 
ficient information and in a too large patch all important 
information might get lost. For our classifier, we would like 
to learn the size of the context automatically. If the range 
of the context was determined by the rectangle r centered 
at the pixels x\ E I\ and X2 E I2 , in the ideal condi¬ 
tions the feature representations evaluated over each rectan¬ 
gle T> r (/i, x\) and <f> r (/ 2 , X 2 ) would be nearly identical, i.e. 
the difference of the feature representation is close to a 0- 
vector. In the presence of an unknown consistent systematic 
error, such as the second camera is consistently darker, or 
the first picture is always taken in summer and second scene 
in winter, this difference follows certain patterns, which 
should be leamable from the data. Thus, we want to model 
the matching classifier as the function of the difference of 
feature representations <F r (/i, x\) — <F r (/ 2 , ^ 2 )- If the like¬ 
lihood of a systematic error is the same for both images, 
we can alternatively use the absolute difference. The con¬ 
fidence of the classifier should significantly increase, if we 
use contextual information over multiple scales, the feature 
representations of multiple, not necessarily centered, rect¬ 
angles surrounding the pixel. During the learning process, 
our classifier should determine the importance of different 
scales and contextual ranges automatically. Thus, the in¬ 
put vector will consist of the concatenation of differences 
of feature representations: 

$( h,h,xi,x 2 ) = concat reR ($ r (h,Xi) - $ r (I 2 ,x 2 )), 

( 1 ) 

where concat(.) is the concatenation operator and R is the 
fixed set of randomly sampled rectangles surrounding the 
pixels being matched. 

2.1. Related approaches 

The contextual matching feature representation for learn¬ 
ing the matching function is largely motivated by a suc¬ 
cessful application of similar representations IS M EH H2 
to the semantic segmentation problem. For this particular 
task, the direct concatenations of the feature vectors of the 
surrounding pixels lflQlfl2ll or the concatenation of the fea¬ 
ture representations of the rectangles Elm surrounding 
the pixel have been previously successfully applied. The 
restricted forms of representations are used mainly, because 
they can be efficiently evaluated either by a direct look-up 
or using the integral images. The main difference between 
semantic classifiers and our matching classifier is, that the 
semantic classifier finds the mapping between the image 
patch and the semantic label; on the other hand for matching 
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problem we learn the space of possible visual appearance 
changes the matching function should be invariant to. 

2.2. Feature representations for stereo matching 

The most common rectangle representation for a se¬ 
mantic classifier is the bag-of-words representation El ED- 
Each individual dimension for such representation can be 
calculated using the integral images, built for each individ¬ 
ual visual word. This leads to a very large memory con¬ 
sumption during training, and thus requires a sub-sampling 
(typically 4x4-10x10, depending on the amount of data) 
of the integral images. The sub-sampling by a factor k re¬ 
stricts the location and the size of the rectangles to be the 
multipliers of k. This is not a big problem for the semantic 
segmentation, where it leads to noisy objects boundaries, 
because it could be handled using an appropriate regular¬ 
ization hd. However, for stereo matching it would limit 
our matching precision. Thus, we combine the sub-sampled 
bag-of-words representation of each rectangle with another 
representation, which contains the average feature response 
of a low-dimensional non-clustered feature. Each dimen¬ 
sion for any rectangle can also be calculated using integral 
images, but requires only one integral image per dimension, 
and thus does not require sub-sampling. Combining ro¬ 
bust powerful imprecise bag-of-words representation with 
a rather weak but precise average-feature representation al¬ 
lows us to train both a strong and smooth classifier. 

2.3. Training the classifier 

We use a standard AdaBoost framework da to learn the 
binary classifier H($(Ii,xi) — 4>(/ 2 , x 2 )). Positive train¬ 
ing samples correspond to each matching pairs for pixels. 
Negatives are randomly sampled from the overwhelmingly 
larger set of non-matching pairs. Positives, respectively 
negatives, were weighted based on inverse ratio of occur¬ 
rence of the positive and negative class. Thus, training be¬ 
came independent on the number of negatives used, once a 
sufficient number is reached. In practice we used 50 x more 
negatives. As weak classifiers we used decision stumps, de¬ 
fined as: 

h(I 1 ,I 2 ,x 1 ,x 2 ) = > 0) + b, 

( 2 ) 

where 4> r (.)^ is the i — th dimension of a feature represen¬ 
tation over rectangle r and 6 , a and b are the parameters of 
the weak classifier. The final classifier is defined as a sum 
of weak classifiers: 

M 

H(I 1 ,I 2 ,x 1 ,x 2 ) = Y h m (I 1 ,I 2 ,x 1 ,x 2 ), (3) 

m= 1 

where h 171 is the m-th weak classifier. The most discrim¬ 
inant weak classifier minimizing exponential loss is itera¬ 
tively found in each round of boosting by randomly sam¬ 
pling dimensions of the feature representations 4> r (/i, x^) — 


4> r (/ 2 , £ 2 ), a brute force search for the optimal 0 and close 
form m evaluation of the a and b parameters of the weak 
classifier. We refer to I CD for more details. The large di¬ 
mensional feature representations are not kept in memory, 
but rather evaluated as needed by the classifier using the in¬ 
tegral images. 

The learning procedure automatically learns, which fea¬ 
tures are rare and significant, such that they must be 
matched precisely; and which features are common, for 
which the regions they belong to must be matched in¬ 
stead. Furthermore, the learning algorithm determines the 
importance of different scales and the space of appearance 
changes, it should be invariant to. Unlike in coarse-to-fine 
approaches, the matching function is found jointly over all 
scales together. 

The prediction is done by evaluating the classifier for 
each pair of matching candidates independently. For stereo 
estimation the response for pixel x and disparity d is 
H(x,d) = H(Ii, / 2 , x, x — (d, 0)), for optical flow the re¬ 
sponse for flow / = (f x ,fy) is H(x,f) = H(I 1 : I 2 ,x,x + 
f ) and for pixel-wise change detection it is H(x) = 
H(I u I 2 ,x,x). 

2.4. Regularization 

For stereo and optical flow problems we adapt the fully 
connected CRF approach era minimizing energy: 

E(d) = + Y V’ij(diVj), (4) 

iei2 i,jei2,i^j 


where —H(x^ di) is the unary potential for pixel Xi and the 
pairwise potentials xpij(di, dj) take the form: 

/ / 1 7N /7 7N f \C{ — Cj \ 2 \xi — Xj \ 2 

Tpijidi, dj) = dj) exp(— 2 - 2 )> 

Z(J app Z(J loc 

(5) 

where C % and C 3 are the colour (appearance) features for 
pixels i and j, xi and Xj are the locations and cr 2 pp and 
crf oc are the widths of the location and appearance kernels. 
The label compatibility function /iij(di,dj) enforces local 
planarity and takes the form: 

/ , , n / \di ~dj -pi • (xj ~Xi)\ 2 

Hij(di,dj) = exp(-- ——2 ---), (6) 

Z(7 pln 

where cr 2 ln is the width of the plane kernel and pi = 
(PiiPi) are coefficients of a fitted plane, obtained us¬ 
ing RANSAC by maximizing: 


Pi = arg max 
p/ 



exp( 


\Ci-Cj \ 2 

2<J app 


\Xi-Xj \ 2 

2<t 2 ’ 

Z(J loc 


S(\di - dj - p' ■ (:Xj - Xj )| < a) 


(7) 


3 







where a = 1 px is the threshold and S(.) Kronecker delta 
function. In each iteration the local plane is fitted for each 
pixel xi based on the current marginal beliefs by solving 
the optimization problem ([7]) using RANSAC. In the next 
step, the marginals are updated for each pixel as described 
in fl4l . We repeat this procedure till converges or till our 
maximum number of allowed iterations is reached. 

2.5. Implementation details 

In our implementation of the classifier we used the 4 
bag-of-words representations using texton I1T5) . SIFT d, 
local quantized ternary patters fT71 and self-similarity fea¬ 
tures ca. Each bag-of-words representation for each rect¬ 
angle consisted of a 512 dimensional vector, encoding the 
soft-weighted occurrence of each visual word in the rectan¬ 
gle. The visual words (cluster centres) were obtained using 
k-means clustering. The soft weights for 8 nearest visual 
words were calculated using a distance-based exponential 
kernel ED Integral images for the bag-of-words represen¬ 
tations were sub-sampled by a factor of 4 x 4. Addition¬ 
ally we used the average-feature representation, consisting 
of 17 dimensional non-clustered texton features, containing 
average responses of 17 filter banks. The final representa¬ 
tion was a concatenation of 4 bag-of-words and 1 average- 
feature representation, both over 200 rectangles (different to 
each other due to a different sub-sampling of these represen¬ 
tations). The final feature vector - an input of the classifier 
- was (4 x 512 + 17) x 200-dimensional. The boosted clas¬ 
sifier consisted of 5000 weak classifier - decision stumps, 
as in 0. The difference of feature representations can be 
discriminative only if the rectangles in the left and right im¬ 
ages are of the same size. This could be a problem if the 
shifted rectangle is differently cropped by the edge of an 
image. To avoid this, we simply detect such problem and 
update the cropping of another rectangle accordingly. Prior 
to regularization, a validation step of inverse matching from 
the reference frame to the frame matched was performed, to 
handle occluded regions. Classifier responses for the pixels 
that, did not pass the validation step, have been ignored in 
the regularization. 

3. Experiments 

We evaluated the classifier for three matching problems - 
outdoor KITTI 0 stereo dataset, Sintel optical flow dataset 
and our new TimeLapse change detection dataset. 

3.1. KITTI data set 

The KITTI data set 0 consists of 194 pairs of train¬ 
ing and 195 pairs of test images, of the resolution approx¬ 
imately 1226 x 370, containing sparse disparity maps ob¬ 
tained by Velodyne laser scanner. All labelled points are in 
the bottom 2/3rd of an image. Qualitative results are shown 
in Figure [T] The 3D point clouds obtained from the dense 


stereo correspondences are shown in Figure [3] The Fig¬ 
ure [2] shows the difference in the starting point between our 
classifier and standard feature-matching approaches, which 
after suitable regularization is also able to get close to the 
state-of-the-art. 

Our classifier uses colour-based features, and thus was 
trained and evaluated on the pairs of images obtained by the 
pair of coloured cameras. The comparison on the evaluation 
server is done on the grey cameras instead. To compare our 
method to the state-of-the-art, we evaluated it on the col¬ 
orized version of the grey images 0 The quality of the 
output images deteriorated slightly, partly also due to im¬ 
precise process of colourization. Quantitative comparisons 
can be found in the table Q] 

3.2. Sintel data set 

The Sintel dataset (clean version) consists of the short 
clips from the rendered open source movie Sintel. It con¬ 
tains 23 training and 12 test clips, each one from 20 to 50 
frames long. Together there are 1041 training and 552 test 
pairs of images of the resolution 1024 x 436. Most pixel 
flows are in the range of (—160,160) x (—160,160). Due 
to memory limitations, the evaluation of our classifier is not 
feasible for such a large range of flows. Thus, we trained 
and evaluated our method on the sub-sampled images (and 
thus also flows) by a factor of 4 x 4. Qualitative results 
are shown in the figure [5] We could not compare quanti¬ 
tatively to other methods for sub-sampled images, because 
the ground truth labelling is not available. Our classifier 
showed the potential to be used as a preprocessing step for 
another optical flow method; either as an initialization for its 
inference or as a way to restrict an allowed range of possible 
flows for high resolution images in a coarse-to-fine fashion. 

3.3. TimeLapse data set 

In this paper we also introduce a new TimeLapse tem¬ 
poral change detection dataset. It contains 134 pairs of 
training and 50 pairs of test images and annotated pixel- 
wise ground truth indicating the structural change in the 
scene, such as new buildings, tents, vehicles or construc¬ 
tion sites. Lighting or seasonal changes (snow, leaves on the 
tree) are not labelled as change. The dataset will be made 
available on publication. Qualitative results of our classi¬ 
fier are shown in Figure [4] Our classifier successfully man¬ 
aged to distinguish between seasonal changes and structural 
changes in a scene. Quantitatively, 93.3% of pixels were 
correctly labelled, average recall per class was 86.9%, aver¬ 
age precision 87.9%. 

4. Conclusions 

Our contextual matching classifier showed the poten¬ 
tial to be a promising direction of further research in the 
this area - the application of discriminative training for this 
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Method 

Result 3px 

Result Occ 3px 

Result 5px 

Result Occ 5px 

PCBP-SS 1201 

3.40 % 

4.72 % 

2.18 % 

3.15 % 

pcBPirn 

4.04 % 

5.37 % 

2.64 % 

3.64 % 

wSGM I22l 

4.97 % 

6.18 % 

3.25 % 

4.11 % 

ATGV [231 

5.02 % 

6.88 % 

3.33 % 

5.01 % 

iSGM [24] 

5.11 % 

7.15 % 

3.13% 

5.02 % 

ALTGV [25] 

5.36 % 

6.49 % 

3.42 % 

4.17% 

ELAS l26l 

8.24 % 

9.96 % 

5.67 % 

6.97 % 

OCV-BM [27] 

25.38 % 

26.70 % 

22.93 % 

24.13 % 

GCut + Occ [28l 

33.49 % 

34.73 % 

27.39 % 

28.62 % 

Our Result 

5.11 % 

5.99 % 

2.85 % 

3.43 % 


Table 1. Quantitative comparisons of our method with competing state-of-the-are approaches on the KITTI dataset in terms of ratio of 
pixels outside of the 3 and 5 pixel threshold, and in terms of average disparity error per pixel. All evaluations are done either on all pixels 
or except occluded pixels that are marked out. Approaches m and ED used all 21 pairs of frames in the videos. Our method gets 
comparable results to theirs, in particular for 5px threshold. 


task. We validated our approach on three challenging prob¬ 
lems - stereo matching, optical flow and pixel-wise tempo¬ 
ral change detection. In our further work we would like to 
test various different design choices to maximize the quan¬ 
titative performance of our classifier. It might give us some 
insights, how to design the discriminatively trained classi¬ 
fier, that would generalize across different data sets. 
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Figure 1. Qualitative results on the KITTI dataset. Our classifier could successfully match pixels for a wide range of depths. Most of 
mismatches are due to the occlusion or for sky pixels, which were not used during training. Regularization further improved the results by 
inducing smoothness and local planarity. Reconstructed 3D point clouds for the same images could be found in the figure[3] 



Left image Right image Our Unary Result Feature Matching Result 

Figure 2. Comparison of our unary result with standard feature matching classifier, obtained as an average of Sobel Filter matching [59 j 
and Census transform ES). Standard unary matching result is much more noisy and fails on texture-less surfaces, such as white walls. 
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Figure 3. Reconstructed 3D point clouds displayed from a different view point, obtained from the same pairs of images in the same order 
as in Figure [T] and [2] The visually appealing synthetically rendered views give a much better idea of the quality of estimated depth. 



Figure 4. Qualitative results on the TimeLapse dataset. Our classifier managed to distinguih between structural changes and seasonal 
changes. The appearance of trees in 1st and 6th image have changed dramatically, however, our classifier did not label it as a change. 
Similarly the snow in the background of 2rd and on the roof of 3th image was not labelled as change. On the other hand, structural 
changes, such as new building built, have been consistently recognized as a change. 
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Reference image 


Matched image 


Unary result 


Regularized result 


Figure 5. Qualitative results on Sintel dataset. For visualization we hand-picked pairs of images with larger range of 
displacement. The amplitude of the flow is normalized by the maximum flow in each individual image and colour- 
coded using the flow chart displayed on the right. Unary result shown is after the inverse matching validation step to 
remove the dependency of colour coding to noise. Black pixels correspond to pixels, that did not pass the validation 
test. The results suggest, our approach could be used as an initial step of the coarse-to-fine strategy for optical flow 
problem. 
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